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PARAMETRIC STATE SPACE STRUCTURING 

GIANFRANCO CIARDO* AND MARCO TILGNER* 

Abstract. Structured approaches based on Kronecker operators for the description and solution of 
the infinitesimal generator of a continuous-time Markov chains are receiving increasing interest. However, 
their main advantage, a substantial reduction in the memory requirements during the numerical solution, 
comes at a price. Methods based on the “potential state space” allocate a probability vector that might be 
much larger than actually needed. Methods based on the “actual state space”, instead, have an additional 
logarithmic overhead. We present an approach that realizes the advantages of both methods with none of 
their disadvantages, by partitioning the local state spaces of each submodel. We apply our results to a model 
of software rendezvous, and show how they reduce memory requirements while, at the same time, improving 
the efficiency of the computation. 

Key words. Kronecker algebra, Markov chains, structured state space. 

Subject classification. Computer Science 

1. Introduction. Markov chains can be used to model complex systems exhibiting stochastic behavior, 
but their numerical solution is limited by the high memory requirements. For continuous-time Markov chains 
(CTMCs) in particular, three large data structures need to be stored in a conventional solution approach: 
the state space T, the transition rate matrix R, and the desired solution vector (e.g., the steady-state 
probabilities 7 r). Of these, R is the largest even when using sparse storage, since it contains as many 
nonzero entries as the number of possible state-to-state transitions. 

Thus, much work has been performed in the last decade on techniques to store R implicitly using 
Kronecker operators, starting with the “synchronized” stochastic automata of Plateau [15, 16, 19]. Buchholz 
[1] used a similar idea for Markovian closed (asynchronous) queueing networks, and Takahashi [22] used it for 
open queueing networks with communication blocking. Donatelli [10, 11] adapted the approach to generalized 
stochastic Petri nets (GSPNs), introducing the “superposed GSPNs (SGSPNs)”, further extended by Kemper 
[ 12 ]- 

In these approaches, K submodels are “composed” through some rules. For example, in the SGSPNs, 
individual submodels having an underlying ergodic CTMC are composed by merging transitions in two or 
more submodels, so that they conceptually fire at the same time. The state space X of the overall model is a 
(possibly proper) subset of the cross-product T of the state spaces of the individual submodels. Analogously, 
the matrix R is a submatrix of a matrix R obtained by combining a set of small matrices that describe the 
effect of an event on an individual submodel, using Kronecker products and sums. 

The possibility that the “potential” state space T might contain states not in the “actual” state space 
T has been addressed in two fundamentally different ways: 

• One can ignore the distinction between T and T, and write algorithms that use a vector n of size 
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|T| instead of a vector n of size T, and iterate on R [15, 16, 19, 1 , 17, 21]. This approach is simpler 
and does not require to store the state space T, but it might unnecessarily fail for lack of memory 
when |T| » |T|. 

• If a vector n of size |T| is used instead, the iterations of the numerical methods must be restricted 
to operate on the correct submatrix R of R. This can be done by explicitly storing the state space 
T , and then using the entry R, j computed through Kronecker operations, if, and only if, both i 
and j are reachable states in T [ 12 , 24, 9]. This involves searching for states in the data structure 
storing T, resulting in an additional logarithmic overhead. 

In this contribution, we explore a third possibility, which in the best case achieves the performance of 
the simpler methods based on T while using even less memory than those based on T. The approach uses a 
partition of each local state space in such a way that either T or a superset T of it (hopefully much smaller 
than T) can be encoded in a negligible amount of space. Furthermore, the question “is this state reachable” 
can be answered in 0{K) time, that is, by simply looking at its components, without having to perform a 
logarithmic-time search. 

While finding the “best partition” that will minimize the memory requirements is unfeasible in general, 
we show how the presence of invariants in the model can be used to guide us toward a “good partition” that 
can be used by our algorithm. 

The resulting block-partitioning of the matrix R is analogous to that introduced by Buchholz [ 2 , 4] for 
the solution of “asynchronous systems”, essentially restricted GSPN models composed of submodels where 
the interactions are not due to merging transitions but to a transition having input (s) in one submodel and 
output(s) in another. Also related to our work is a recent contribution by Campos, Silva, and Donatelli [5] 
on “stochastic deterministically synchronized sequential processes” . 

Our approach, however, is not restricted to a particular pattern of interaction between submodels, and 
can cope with “imperfect” partitioning, which might result in the most efficient approach in practice, as we 
show in our example. 

The paper is structured as follows. Section 2 describes the notation used and recalls the definition 
of Kronecker product and sum. Section 3 defines high-level models and their underlying CTMCs. Then, 
Section 4 describes how both the state space and the transition rate matrix can be described in a structured 
way from information “local” to each submodel, and recalls the main ideas of the potential and actual 
state-space- based solution approaches. Sections 5 and 6 introduce our main idea, the partition of local state 
spaces, and discuss how to find a good partition in practice. The effect of applying the partition to R is 
discussed in Section 7, while the resulting equations for the solution of the CTMC are derived in Section 
8 . Finally, Section 9 uses the new technique to study a software tasking system and details the timing and 
memory requirements under alternative solution approaches. 

2 . Notation. Except for the sets of natural numbers, iV = {0,1,2...}, and real numbers, JR , all 
sets are denoted by upper-case calligraphic letters (e.g., X); (row) vectors and matrices are denoted by 
lower- and upper-case bold letters, respectively (e.g., x, A); their entries are denoted by subscripts (e.g., x*, 
A, ; a set of indices can be used instead of a single index, for example, A x,y denotes the submatrix of 
A corresponding to set of rows X and the set of columns y. We also index families of like-quantities with 
subscripts or superscripts (e.g., Xi or x l ) and use a shorthand “range” notation to indicate sequences of them 
(e.g., n j — x‘i, . . . , Xfi) 

77 (A) denotes the number of nonzeros in matrix A. 0 nXm and l RX m denote matrices with n rows and 
m columns, having all entries equal 0 or 1 , respectively, while I n denotes the identity matrix of size n x n; 
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the dimensions of these matrices may be omitted when they are clear from the context. Given a vector x, 
diag(x) is a square matrix having vector x on the diagonal and zero elsewhere. 

Given a n x n matrix A, rwsm(A) = diag( A • l n xi) is a n x n matrix having the diagonal equal to the 
sums of the entries on each row of A, and zero elsewhere. 

We recall the definition of the Kronecker product A = <S)fc=i of K rectangular matrices A k G R rh XCk , 
using the convention that the rows and columns of both A and the matrices A k are indexed starting at 0. 
Let n} 1 = nL, n k- ^ we assume a mixed-base numbering scheme, the tuple i[\ y K] corresponds to the row 
index (. . . ((n)r 2 + z 2 )^3 * * )tk + in = * k ■ r k+i, and vice versa; the interpretation of a column index 

j[\,K] is analogous, except that c.k must be used instead of r* . Then, the generic element of A 6 1R 7 ^ xr * is 
given as 


( 2 . 1 ) 


A • — A 

***,3 — ^[l.KlJll ,K] 


- a 1 

_ A ii,h 


A 2 • ■ • A k 

* 2, .72 IK,JK ‘ 


The Kronecker sum A k is defined, only for square matrices A k G iR n * xn *, in terms of Kronecker 

products, as 


K 

^ ^ In i & ' ' ' & In*, i & A & Irifc+i & 1 ‘ ‘ & In/v • 

k~ 1 

We are interested in algorithms that exploit sparsity. For the Kronecker product, the number of nonzeros 
is simply v(<S>k=i A fc ) = IlLi »?(A fc ). 

Table 2.1 summarizes the symbols used in the paper. 

3. Description and solution of a Markov model. We assume that the Markov model is expressed 
in a structured high-level formalism as a collection of K submodels rrqi ^j. These define: 

• A set of potential states, T = xjf =1 T k , where T k = {0, . . . , n* - 1} is the set of rik potential local 
states for rn^. 

• A set of events, €. 

• An initial state, ijj G T. 

• Rules to define whether an event e G £ is active in a state i[\,K] € T, G {True, False], 

• its rate of occurrence when active, r<(e, i[i,/c|) > 0, 

• the state obtained when e occurs in state i[ i>K \, nettle, € T. 

From these, we can build: 

• A set of reachable states having an exponentially distributed holding time, or state-space, T, and 
the sets of local state-spaces T k , simply the projection of T on its k - th component. Without loss of 
generality, we assume from now on that T k = T k . 

• A transition rate matrix, R G 2Rl T l x l T l, describing the transition rates between reachable states. 

• An initial probability vector, 7r(0) G IR) r \. 

T can be generated using a state-space exploration algorithm, such as BuildRS , shown in Fig. 3.1, 
which terminates iff T is finite. T and U are the sets of states explored so far, or found but not yet explored, 
respectively. If U is managed as a FIFO linked list, BuildRS performs a breadth-first search of the graph 
implicitly defined by the model. The matrix R can be generated at the same time, or in a second pass, once 
|T| and r;(R) are known, so that an efficient sparse row-wise or column-wise format can be used [14]. 

The solution algorithms we discuss index the states according to “lexicographic order”, # : T — ► 
{0, . . . , |T| — 1}, such that > Vt'Op,/© iff follows j[i,K] m a lexicographic comparison. This 
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Table 2.1 

Symbols used in the paper 


Symbol 

Meaning 

K 

Number of submodels 

TUk 

k- th submodel 

rik 

Size of potential local state space for nik 

n F 

IK-int 

f,T 

Potential, actual state space 

fk T k 

Potential, actual local state space for m* 


Lexicographic position of i[i,K] 

£ ,£ k ,£* 

Events: overall, local to m^, synchronizing 

R,R 

Transition rate matrix 

R fc 

Local transition rate matrix for rrik 

W fc ’ 1 

Weight matrix for synchr. event e/ on m k 

7T, 7T 

Steady state probability vector 

h, h 

Expected holding time vector 

Pk 

Number of classes in the partition of T k 


{0, . . . , Pk - 1}, class indices 

'jk-.p 

p- th class in the partition, p G V k 

Tlk:p 

|T fc:p |, size of the p-th class 

n 

Relation defined on the sets V k 

B 

Blocks defining R 


BuildRS( in: £ ,act,new\ out: T); 

1. T 0; 

2. u {$, *,}; 

3. while U ^ 0 

4. “remove the first state ^ rom ; 

5. T TU{t [lf * 3 }; 

6. for each e* € £ s.t. act(ei,i[i y K\) 

7- j[i,K] new(e l: i [liK ]); 

8- if j[i y K] &T\JU then 

9. u wu{i[i,/c]}; 

Fig. 3.1. Algorithm BuildRS 

assumes that states (or their encodings) are somehow comparable. Furthermore, we extend *i> so that, 
Vi (1 ,Ki £ f\T, $(i[i,icj) = null. 

We focus on the computation of the steady-state probability (row) vector 7r G JR |t| satisfying 
(3.1) tt(R — rwsm (R)) = 0 1x | T) and 7T-1 | T | X i = 1, 

but the techniques we discuss are equally applicable to the computation of transient and cumulative measures 
[6]. We note that the indices of 7 r and R correspond to the reachable states through the mapping T*, that 
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is, the steady-state probability of state £ T is stored in position ^(i[i,/c]) of n. 

There are two difficulties in solving the system of linear equations (3.1). First, models of interest can 
describe CTMCs of enormous size, preventing us from using direct methods such as Gaussian elimination, 
which cause excessive fill-in. We are then limited to iterative methods using sparse storage schemes for Ft, blit 
even these methods are memory-bound when applied to realistic examples. Furthermore, the convergence 
rate of iterative methods can be quite poor when applied to stiff models, such as those arising in reliability 
analysis. 

In the following section, we recall results from [8, 9, 3] showing how the state-space T and the matrix 
R can be represented in compact form using Kronecker operators, thus reducing the storage requirements, 
but possibly at the cost of execution efficiency. Then, we show how, by an appropriate partition of the local 
state-spaces, we can achieve better performance, while further reducing the storage requirements. 

4. Structured description of T and R. In [8], we showed how T can be stored in an efficient 
dynamic multi-tree data structure during the execution of BuildRS. After T is known, this data structure 
can be further compressed by using arrays instead of dynamically balanced search trees, as shown in Fig. 
4.1. 

Using the data structure in Fig. 4.1 to compute ^{i[i y K\) is straightforward: 

1. Use binary search to find i\ in the array for m\ and follow the pointer to the array for m2. 

2. Search i 2 in the array for m 2 (only the grey portion of size < 712 between the pointed position and 

the next pointed position must be considered). If found, follow the pointer to the array for m3. 

3. Continue to follow the pointers until either a local state ik is not found (implying that i[i ,K) 4 T), 

or until Ik is found (implying that £ T and that its lexicographic position is the 

same as the position where i k was found in the array for mx). 

Assuming that the portions of the arrays searched at each level are of comparable size, the overall 
complexity to compute ^(f(i,K]) is 0(log|T|), the same as if T were stored in a single search tree or in a 
contiguous array. However, there are several advantages: 

• The memory requirements to store T using this data structure can be as low as 0([log7i/c] ■ T) 
bits, as long as, for any sub-state i[i,ic— 1]» the sets of local states {i K € T K : i[i,x) £ T} are either 
empty or reasonably large. In Fig. 4.1, this requirement implies that the last array, for mx, should 
be substantially larger than the array for ttik-u and so on. The straightforward data structure 
requires instead 0(53fcLi [logn^] • T) bits. 

• Only local states are compared at each step, not global states as in the straightforward algorithm. 

• It is often possible to realize that i[\,x] ^ T without having to consider all its components (on the 
other hand, it is impossible to determine that i[i,x) £ T before examining the array for m#)- 

• After searching for i[\ s K)y & search for a state with the same first k components can reuse the work 
already performed to find the path corresponding to ijijq, and start directly at the array for m*+ 1. 
This is the key to lowering the complexity of the vector-matrix multiplications presented in [3] . 

In practice, our implementation automatically uses 8-, 16- or 32-bit indices for the local states of m*, 
depending on the value of njt . For most models, nx < 2 8 or 2 16 , hence we can store T in little over |T| 01 
2|T| bytes using the data structure of Fig. 4.1. 

Following [9, 3], we assume that the activation, rate, and effect of an event can be determined in a 
“distributed fashion”. In other words, this implies that the effects of the various submodels on the event are 
independent of each other. Formally, given an event ei £ 8: 




FlCi. 4.1. Efficient storage ofT. 


• ei is active in i[i t K] iff it is “locally active” in each local state: 

(4.1) act(ei, i[i #]) = acb l ' l (i i) A • ■ ■ A act K ' l (ix) 

where act kl : T k — » {True, FaZse} is derived from the model. 

• The effect of the occurrence of ei on a local state does not depend on the other local states: 

(4.2) new(ei,i[i iK ]) = (new M (n), . . . y new Kll (i K )) 

where new k ' 1 : T k — * T k is given by the model. 

• Finally, the rate of an active event e/ in state i[i,K] can be expressed as 

(4.3) rt(ei,i [liK] ) = A / • wgt l ' l (ix) • ■ • wgt K ' l {i K ) 


where A/ G JR -1 " is a constant “reference rate” and wgt kyl : T k — ► ]R + are (dimensionless) scaling 
weights, also given by the model. 

If the activation, effect, or rate of an event ej are not dependent on a local state i we let act k ' 1 = True , 
new k ' 1 = I, the identity function, and wgt k ' 1 = 1, respectively. 

We can then partition the set of events £. An event ei is local to we write e* € £ k , if its activation, 
rate, and effect are determined by, and limited to, the local state ik of rn^ alone: 

£^e, e £ : W ± k, act k ‘ ■' = True 

A new k ll = I A utgt k ' l = 1 1 . 


The events in 


K 

£• = S \ (J E k 

k= 1 


are instead said to be synchronizing. For notational convenience, these are numbered first: ei G £* iff 
1 <1<L = \£*\. 
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It has been shown in [9] (but see also [15, 16, 1,11] for more restrictive statements) that, when the 
conditions specified in (4.1), (4.2), and (4.3) hold, R can be expressed as 

K L K 

( 4 - 4 ) R = Rr,r = 0R + 

k=l 1=1 k=l 

where 

• R is a T xT matrix that can be described as the sum of Kronecker products and sums of much smaller 
matrices, and has the fundamental property that its entries in the rows and columns corresponding 
to states in T coincide in value with those of R. 

• is a n* x n k “weight” matrix defined as 

{ wgt k ' l (i) if act k ' l (i) — True 
and j — new k ' l (i) 

0 otherwise 

• R k are the local transition rate matrices describing the effect of local events: 

R* = X i‘ 

tnee k 

This description of R can then be used in iterative methods such as Jacobi, Gauss-Seidel, or their 
relaxation versions [3], avoiding the large storage requirements for R entirely. Only local matrices such as 
and R A need to be stored, and their memory requirements are small even for reasonably large K and L. 
The only remaining memory constraint is the storage of the iteration vector(s). However, these approaches 
based on Kronecker operators- still suffer from important limitations, which we now address. 

4.1. Implementations based on T. In the first implementations that appeared in the literature 
[1, 15, 16, 18], a vector 7f (E JR nj is used to store 7 r. In it, only the entries corresponding to states ip j £ T 
are actually used, while the remaining entries, corresponding to potential but not reachable states, are always 
zero. 

All Kronecker operations are performed ignoring the distinction between reachable and potential states 
(testing for zero can still be used to avoid some, but not all, useless computations). The main advantage of 
this approach is that, given a state i[i,/c]> the position of its corresponding entry in 7 r can be computed in 
O(K) operations, as Y,k=\ h * n k +v 

In addition, exploiting the structure of the matrix resulting from a Kronecker product, the vector-matrix 
product x • (<g)f =1 A k ) can be computed in 0(n f r}(A k )/n k ) operations, instead of 0{ nf =i ^(A*)). 

This result can be obtained by extending the Plateau-Stewart Theorem 9.1 of [20] to the case where the 
matrices A k are stored with an appropriate sparse format. 

Unfortunately the practical impact of this elegant result might not be as great as one could hope 
because the matrices involved are often “ultrasparse” , with r](A k ) at most slightly larger than n*. Assuming 
e nonzeros per row, the complexity is then 0{n?-Ke) versus 0(n ? e K ) for the straightforward multiplication, 
which is then a better choice whenever e < e^. For example, when e = 1, application of the Plateau- 
Stewart Theorem results in a complexity 0{n% ■ K), worse than the 0(nf ) complexity of the straightforward 
multiplication (of course, the memory savings might still justify using a Kronecker approach). 

4.2. Implementations based on T. Alternative implementations that operate directly on the vector 
7r, thus avoiding 0(n?) memory requirements, have been recently proposed [9, 3, 12]. This is an important 
advantage whenever |T| < |T|, often the case in practice. 
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However, these memory savings come with a price. When using T, a state z [1>K -] can be easily associated 
with its index in -ft, while now we must compute $(i[ liK j) to obtain the index in 7 T, using a logarithmic 
search in T. A simplistic implementation of the vector-matrix product 

x (g)A fc ^ 

k=l / 7",T 

has then complexity 



if performed “by rows”, as needed for a Jacobi-style iteration, or 



if performed “by columns” , as needed for the faster Gauss- Seidel-style iteration (the difference is that spurious 
entries from states in T\T to states in T might be encountered in the latter case). 

In [3], the logarithmic factor log |T| is reduced to logn* with the help of the data structure of Fig. 4.1, 
but it is not completely eliminated. 

5. Partitioning the local state spaces. We now demonstrate how an appropriate partition of the 
local state-spaces can achieve the advantages of the methods based on either the potential or the actual state 
space, without many of their drawbacks. 

Partition each local state space T k into P k classes 

P k - 1 

'jk _ q-k-v 

P = 0 

let Uk:j, — \T k:p \ be the number of states in class T k ‘ p , and define V k = {0, . . . , P k - 1} to be set of indices 
corresponding to the classes in the partition. We can then define a relation 7 Z C x£ =l V k over these sets of 
indices. Two interesting cases arise, which we denote as perfect and imperfect partition, respectively. 

5.1. Perfect partition. Assume that 1Z can be defined so that 

(5.1) p [hK] e 7Z=>T D x k=1 T k:pk and 

(5.2) Pl i,*] $ 7 Z^T n xf =1 T fc:p * = 0. 

Then, this means that the actual state space T can be described very compactly as the set of states contained 
in the cross-products of the classes corresponding to the elements of 1Z\ 

T= [J 

P[i,k)€K 

Note that, since the P k classes constitute a partition of T fc , these cross-products constitute a partition of T, 
that is, given two tuples P[i,k] an d differing in at least one component, the cross-products x^ =1 T fc:pfc 

and x£ =l T kuik contain disjoint sets of reachable states. 

Hence, we can encode the entire state space T using only a description of the local states for each 
submodel plus a boolean vector of size IIaLi to describe the relation TZ. We call this a “level-1” vector. 
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Submodel mj: 0 1 2 

/A\ /A\ /A\ 

Submodel m 2 : 012301230123 

iiiiiuimi 

Index; 0 1 23456789 10 II 


0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 


Fig. 5.1. An example of perfect partition . 


Of course, 71 can always be found, in the worst case by defining a partition of T k where each state is in 
a class by itself, but this results in 71 = T and the level- 1 vector has size |T|, as the one used by Kemper 
[12]. On the other hand, when T = T, (5.1) and (5.2) are trivially satisfied by defining a partition of T k 

into a single class: P k = 1 and T k = T k:0 . In this case, 71 = {(0, . . . , 0)}. 

The case of practical interest is when the actual state space T is a strict subset of the potential state 
space T, but 1 < P k < n k . Then, large savings can be achieved with respect to traditional techniques to 
store the state space [8] or to Kemper’s boolean vector. 

As an example, consider a model with two submodels, whose local state spaces can be partitioned as 
= ^~ 1,0 U T 1 1 U T 1 2 and T 2 = T 2 0 U T 2 1 U T 2 2 U T 2:3 , in such a way that, V(i 1} i 2 ) € T 1 x T 2 , 

(n, h) e T 4=>(n e T 1:0 A i 2 e T 2:1 u T 2:3 ) V 

(»i €T 1:1 Az 2 er 2:0 ur 2:3 ) V 
(*i € T 1:2 A i 2 e T 2:0 U T 2:1 U T 2:3 ) 

Then, the 12-element boolean vector shown in Fig. 5.1 can be used to encode the state space regardless of 
the sizes of the classes in the partitions (of course, the local state spaces for each submodel still need to be 
stored explicitly). 

To determine whether a given state «[i,Aq is reachable, we simply obtain each index p k of the partition 
P Pk containing the local state i kj we compute the mixed-base value of p^i k j, 

k k \ 

e w n p ')’ 

k=l l=k+l / 

and we check whether there is a “1” in the corresponding position of the boolean vector describing 71. The 
complexity of this test is then 0(K ), independent of the size of T. 

5.2. Imperfect partition. At times, the conditions in (5.1) and (5.2) might hold only for excessively 
fine partitions of the local state spaces, or maybe only in the limiting case P k — 7 i k , where each state is in a 
class by itself. To avoid large storage requirements for 7 Z, we might relax requirement (5.1), substituting it 
with 

(5-3) P[l K] e 7&T O xf =1 T fe:pfc ^ 0 

which, together with (5.2), simply results in 

(5-4) P[1K] e 7 ^nxf =1 r fc:pt / 0 

Then, for each ^ need to encode which states in ^j^ =z \P k are actually reachable. One way 

of doing this is to associate a u level-2” boolean vector of size Y[ k=l n k:pk to each element p [hK] e 71. This 
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Submodel mp 0 1 2 

/A\ /A\ /A\ 

Submodel m 2 : 012301230123 

uuiinuii 

Index: 0 1 23456789 10 II 


level- 1 


0 

0 

0 

0 

a 

0 

0 

0 

a 

a 

a 

0 




Submodel mp a b 

A\ A\ 

Submodel m 2 : f g h f g h 

ittm 

Index: 0 12 3 4 5 


level-2 


0 10 0 11 



Submodel mp c d e 

A\ A\ A\ 

Submodel m 2 : fghfghfgh 

m'mm 

Index: 012345678 


level-2 


• 

i 

i 

0 

I 1 1 0 1 







FlG. 5.2. An example of imperfect partition. 


approach is advantageous with respect to the full boolean vector employed by Kemper [12] provided that 
the overall memory required by the level-1 and level-2 boolean vectors is substantially less than |T| bits. A 
condition for this is that the level-1 vector used to encode 7Z has many “holes”, \7Z\ -ft- However, 

we must also require that these holes do not correspond only to small classes in the partition. 

This is apparent by considering the following extreme case: each local state space T k is partitioned into 
two classes, T k:0 with one state and T k:l with n* - 1 states, and 

n = .,o,i),(i,...,i)}. 

Then, \7Z\ = K + 1 <C = 2*. However, the storage for the boolean vector corresponding to the 

element (1, . . . , 1) of 71 would still require n*Li( n fc - 1) = 0(\T\) bits. 

Continuing our example, assume now that a “1” in the 12-element boolean vector simply means that 
some, not necessarily all, of the corresponding global states are reachable. Then, the two- level approach 
shown in Fig. 5.2 could be used. In the figure, only the level-2 boolean vectors corresponding to the level-1 
values (1,0) and (2,0) are shown, assuming that T 1:1 = {a, 6}, T 12 = {c,d, e}, and T 2:0 = {f,g,h}. For 
example, the entries in the level-2 boolean vector [010011] pointed by the level-1 value (1,0) tell us that only 
states (a,#), (b ) g), and (6, /t) are reachable among those in T 1:1 x T 2:0 . 

5.3. Comparison with alternative approaches. Two competitive alternatives proposed to store 
the state space T are a single boolean vector of size |T| [12] and the multilevel search data structure of 
Fig. 4.1, requiring little over |T|[Iogn/cl bits [8]. From a memory standpoint, the latter is superior when 
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|T|/|T| > [log ti k ] . However, it has higher execution overhead, log \T\ instead of A', when used to determine 
whether i[\,K) € T. 

In the perfect partition case (assuming non-trivial partitions), the approach we just introduced can 
use negligible amounts of memory while, in the imperfect partition case, it can have worst-case memory 
requirements similar to that of Kemper’s single boolean vector approach, although one would hope that it 
would perform much better than that in practice. In either case, the execution complexity to search for a 
state is only 0(K ), due to the direct access capabilities of the level-1 and level-2 boolean vectors. 

Indeed, one could simply consider our storage approach as a specialized way to compress a boolean vector 
of size |f| by eliminating entire sets of zero elements at a time. If, for a particular p [l>K] , the level-2 boolean 
vector happens to contain only l’s (if T I) *jf =l T k:pk ), we could treat this as a special case and avoid storing 
it altogether. We ignore this possibility, although it could be exploited in practical implementations. Clearly, 
the perfect partition case corresponds to the situation where none of the level-2 vector must be allocated 
explicitly. 

We also observe that there is no reason to limit the approach to two levels. The information conveyed 
by a level-2 boolean vector could be stored by further partitioning the classes of local states it corresponds 
to, thus introducing a level-3 vector, and so on. 

Ultimately, the only limiting factor in this process is the algorithm to decide whether and how to further 
partition a set of local states. Section 6 addresses this concern in practical situations. 

6. Determining a good partition. We can formulate the problem of finding the best perfect partition 
as a minimization problem. Given T 1 , . . . , T K and T C x£_ x T k , 

K 

Minimize n P k 

k = 1 

subject to \fk , 1 < A: < K, 

T k0 , . . . , 'T k Pk ~ 1 is a partition of T k , 

VPll./f) € x f=i{0 ,...,P k - 1}, 

(T o xf =i r fc:pfc ) V (T n x* =l T k: r- = 0) . 

Foi an imperfect partition, the quantity to be minimized must include the space for both the level-1 and 
the level-2 boolean vectors: 


K K 

(6-1) Minimize n p k + e n rik:p k 

*=i P[i.K]erc fc= 1 
subject to VA;, 1 < k < K, 

T k:0 , . . . ,T k:Pk ~ l is a partition of T k , 

n cxf =1 {o,...,A t -i}, 

VP[i,tf] € xf =1 {0, ...,P fc -l}, 

P[i.jf] € (T n xf =1 T fc:pt / 0) 

The second component of (6.1) is minimized when the number of “0” entries in the level-2 boolean 
vectors, 

(62) ( E II nfc: p*)-i T i 

\P[l,K]eTC k~ 1 f 
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Fig. 6.1. A simple fork/join model. 

is minimized. Indeed, for a perfect partition, (6.2) is zero. 

The search for the optimal partition is then equivalent to an integer programming problem, which is too 
complex for realistic applications. 

We then require a user-defined state-dependent partitioning function 

part k : T k — V k 

for each submodel m*., assuming that the decomposition of the model into K submodels is defined a priori. 
In most cases this is naturally given by a modular modeling approach, such as in top-down or bottom-up 
system design (see also [7] for a definition of a similar partitioning function used for a completely different 
purpose, the distributed generation of the state space). 

Then, the classes of the local state space partition T k:p are 

T k ' P = {ik I P = part k (i k )} 

and the resulting relation 7 Z is 

U {(p artl ( i i)> - >p ar<A: (*K))}- 

*[i,k]€T 


In other words, we try to find appropriate functions part k to achieve a (hopefully perfect) partition. For 
example, in the application presented in Section 9, we will use the total number of “tasks” within a submodel 
(i.e., its load) as the parameter defining our partition, so that only synchronizations between submodels can 
result in load changes in the involved submodels. An analogous definition is possible whenever a concept of 
load can be clearly defined at the submodel level. 

For GSPN models having an irreducible underlying Markov chain, this might simply reflect the existence 
of invariants in the model (sets of places for which a weighted sum of the number of tokens in them is a 
constant [13]). 

For example, consider the simple fork-and-join model of Fig. 6.1. Local states of submodel nik are 
expressed as triplets (#(afci)#(a* 2 )#(ajfe 3 )), where #(a) indicates the number of tokens in place a. The 
local state spaces for the two submodels are exactly the same, 

r 1 =r 2 = {(002), (on), (ioi), (020), (no), (2oo)}. 

However, not all n\ ■ ri 2 = 36 combinations of local states for m\ and m 2 are reachable. 
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In this GSPN, there are four p- invariants: 


hi '• #(fln) 4- #(a 12 ) 4- #(^ 13 ) = 2, 
h 2 * #(on) 4- #(ai 2 ) 4- #(a 2 3 ) = 2, 

J 2 i : #(o 2 i) 4- #(a 22 ) 4- #(ai 3 ) = 2, 

^22 : #(fl 2 i) 4- #(a 22 ) + #(^ 23 ) - 2 . 

Of these, 7 n and / 22 are “local” invariants, they only affect the reachable local states of m 1 and m 2 , 
respectively. The other two “global” invariants /i 2 and / 21 , though, link the reachable local states of mi 
and m 2 . 

I11 this simple case, the invariants tell us exactly which combinations are reachable: 

#(<*23) = 0 = 4 - #(an) + #(012) = 2 = 4 - #(<113) = 0, 

#( 023 ) = 1 =* #(ou) + #(ai 2 ) = 1 =*► #(ai 3 ) = 1, 

#( a 23) = 2 =4 #((lii) + #((*12) = 0 => #(0,13) = 2, 

hence, any global state (u,i 2 ) € T l x T 2 of the form ((#{«„), #(a 12 ), #(« 13 )), (#(a 21 ), #(a 22 ), #(a 23 ))) 

satisfying #(a 13 ) = #(a 23 ) is reachable (and no other is): 

T = {((002), (002)), ((Oil), (011)), ((Oil), (101)), 

((101), (011)), ((101), (101)), ((020), (020)), 

((020), (110)), ((020), (200)), ((110), (020)), 

((110), (110)), ((110), (200)), ((200), (020)), 

((200), (110)), ((200), (200)) }. 

Now, if we think of places dfc 3 as “idle” and a k \ and a k 2 as “busy”, we can define the load as the number 
of tokens in the busy places, and define the partitioning functions 

part k = #(a kl ) + #(a fe2 ) = 2 - #{a k3 ). 

This choice partitions the local state spaces T k into P k = 3 classes each, 

T k:0 = { ( 002 ) }, 

T k:1 = { (Oil), (101) }, and 

T k/2 = { ( 020 ), ( 110 ), ( 200 ) }, 

Given our choice of partition for the local state spaces, we obtain 7 Z = {(0,0), (1, 1), (2, 2)}, resulting in 
a perfect partition. 

We should note that our choice of local partitioning functions is such that part k : T k — * V k , for eacli 
submodel l < k < K y is invariant with respect to the occurrence of any local event: 

Vei € £\ Vi* 6 T\ 

(6-3) jk = new k,l (ik) =r> part k (i k ) = par £*(7*) 

while synchronizing events always cause at least one partitioning function to change value: 

Vc/€f, V» [liJf] €T, 

( 6 - 4 ) J[i,/c] = riei^(e£ , «[!,/c]) => 3 A;, part fc (u) / part k (j k ) 
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7. Block-partition induced on R. The partitions of the local state spaces just, introduced can be 
used to decompose the matrices R and R into blocks B corresponding to the sets of states identified by the 
tuples P[\,k] e 7 Z. 

7.1. Perfect partition. In this section, we assume that the user-defined partitioning functions satisfy 
constraints (6.3) and (6.4), In other words, we assume all and only synchronizing transitions change the 
load parameter of the submodels. This does not imply a loss of generality, since we can always treat a local 
transition as if it were a synchronizing one. Thus, the matrices R fc and W fc -' can be block-partitioned as 

R^" 0 0 0 

0 R A 1 0 

0 0 H k ‘ p k-i 

and 

W fc ’* :0 ’° Yfkj:0,l ... W M:0,P fc - 1 

yjk,l:P k -l,0 ... yyk,l:P k 

respectively, where 

• The off-diagonal blocks in R k are zero due to constraint (6.3). 

• R fc:p = R-t ■k,„ T k:,. and 

• wMtP.p' _ 

Just as we defined a lexicographic order V I J on 7~, we can define one on the tuples in 1Z: 

n : 7e — {o, . . . , |7e| - 1 }. 

Then, the transition matrix R can be rewritten in a block-structured fashion as 

B° 0 0 

OB 1 0 

0 0 Bl 7 ^- 1 

0 B 0 ' 1 B 0 ’^!- 1 

B 1 ’ 0 0 Bhl K l-i 

BW-to Bl7t|-i,i ... o 

where the blocks are defined as 

K L K 

• B 1 ’ = 0 R fc:p * 4- ^2 A, (g) , 

A:=l /= 1 k— 1 

L K 

/=i k = i 

and b — and b f = Note that the off-diagonal blocks of the first term in the right-hand- 

side of Eq. 7.1 are zero because of the structure of R A , while the diagonal blocks of the second term are 
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zero because of constraint (6.4). If our assumptions were violated, we simply would have to allow for the 
possibility that these blocks might be nonzero instead. Note also that rectangular matrices are involved in 
the Kronecker product of matrices unlike Eq. 4.4, which uses only square matrices. However, 

this does not impose any restriction or complication on the solution algorithm. 

Continuing with our fork-and-join example of Fig. 6.1, the blocks of the transition matrices R* and W kJ 
are 


Then, 


R k0 = [0], R fc:1 


0 0 
0 


R k 2 


0 0 0 
A* 0 0 

0 AA; 0 


W M:0 ’1 = [ 0 1 ] , W *’ 0:1 ’ 2 


0 1 0 
0 0 1 


\jyrk, 3:1,0 _ 

1 

, yyrA:,3:2,l _ 

‘ 1 0 ■ 
0 1 


0 


_ ° 0 


R = 


B° B° 1 0 

B 1 0 B 1 B 12 

0 B 2 i B 2 


(7.2) 


’o 

0 

0 

0 

Ao 

0 

0 

0 

0 

0 

0 

0 

0 

0 

A 3 

o"~ 

0 

T 

0 

o'" 

0 

0 

0 

Ao 

0 

~o~ 

0 

0 

0 

A 2 

0 

0 

0 

0 

0 

0 

0 

0 

Ao 

0 

0 

0 

0 

Ai 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ac 

0 

0 

0 

Ax 

A 2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ao 

0 

A 3 

0 

0 

0 

r 

0 

0 

0 

0 

0 

T 

0 

0 

0 

0 

A 3 

0 

0 

A 2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

A 2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

A 3 

0 

Ai 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

a 3 

0 

Ai 

0 

a 2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ai 

0 

a 2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ai 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ai 

0 

a 2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Ax 

0 

a 2 

0 


As we can easily check, all 14 reachable states of T are properly addressed and no unreachable state is 
created by the Kronecker products within the blocks of matrix R. 

We also observe that the block-tridiagonal form of R is due to the fact that, in our case, synchronizing 
events can only decrement or increment the value of the local partitioning functions by one, a fairly common 
occurrence that could be further exploited in implementations of our method. 
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7.2. Imperfect partition. In the case of imperfect partitioning, the blocks for the transition matrix 
R in Eq. (7.1) must be restricted to the sets of reachable states: 

# ( B )rnx fc K =i r fc: n 1 Tnxf = 1 T* : "/i:’ 

\ / rn x 1 r k t rn x * =1 r 'V- 

with b = n(p[i,/c]) and b' = K |), as before. 

Given again the fork-and-join model of Fig. 6 . 1 , assume that we partition eacli local state space 
k = 1,2 into Pfc = 2 classes as: 


r h{] = {(002), (on), (ioi)} 
r k:l = {(020), (110), (200)} 


This partition is imperfect since T k:0 includes local states with different local loads (values of #( 0 * 3 ) for 
submodel ra^), while it is necessary to impose the global constraint #(«i 3 ) — #(a 23 ) to achieve perfect 
partitioning (this is clearly not a good choice, given the existence of a perfect partition, but it is useful for 
illustrative purposes). 

The blocks of the transition matrices H k and for k = 1,2, are then: 



0 0 Au ' 


1 

o 

o 

0 

1 

r/ 0 = 

1 

o o 
° 

© 

, R fc 1 = 

A fc 0 0 

.040 


" 0 

0 

1 ' 



0 

0 ’ 

0 

0 

0 

\yfc,3:0,0 _ 

1 

0 

0 

0 

0 

0 _ 


_ 0 

0 

0 _ 

" 0 

0 

0 ■ 


" 0 

1 

0 ■ 

0 

1 

0 

.u _ 

0 

0 

1 

0 

0 

1 


0 

0 

0 


With the new choice of partition for the local state spaces, we obtain 71 = {(0,0), ( 1 , 1 )}. Then, if we 
restrict the blocks of matrix 

B° B° 1 
B 10 B 1 

to the reachable states only, we obtain the same matrix R as in Eq. (7.2), but with coarser blocks. More 
precisely, the original blocks define 9x9 matrices, which are reduced to a 5 x 5 matrix for block B°, a 5 x 9 
matrix for B 01 , and a 9 x 5 matrix for block B 1,0 , while B 1 remains a 9 x 9 matrix. Even if imperfect, this 
partition is still more memory-efficient than working directly on the potential state space of size |T| = 36. 

8 . Equations for the solution of the CTMC. We discuss first the perfect partition case. An 
alternative way of writing Eq. 3.1 in a block-structured way is 


n h ■ (B 1 ’ - diag( h 1 ')" 1 ) + ^ tt'*' ■ B b ’' h = 0 


and 
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where 7r - [tt 0 , . . . , *] and h - [h°, . . . , h l7J| *] are the corresponding block-partition of 7 r and of h, the 

vector of expected holding times, and, again, b = n(p [1>K] ) and b' = n(p' [1K] ). Assuming that the diagonal 
of R, hence of every block B'', is zero, h satisfies 


diag( h) = rmTn(R) 1 . 

Its blocks h h 6 ^?l Tnx *=i T * I, 6 = 0,..., 1 72, | - 1, can be computed as 


h° = B b 


i + y, bM/ • 1 


These blocks can be explicitly stored to speed up the computation, instead of recomputing them at each 
iteration. 


Since even the blocks tend to be very large for practical modeling problems, indirect iterative numeri- 
cal methods based on this partition should be applied. In general, such block-iterative methods require more 
computation time per iteration, but they exhibit a good rate of convergence. The iterative method starts 
with an initial guess tt[ 0] and computes successive vectors 7 r[m], until the desired convergence is reached. 
The Gauss-Seidel method can then be written in a blockwise manner as: 


V6 = o, . . . , \n\ - I 5 t r V Hh 1] (tt V] ■ B*+ 


7 T b '[m 4- 1] • B l/ ' b + J2 


6'=0 6 ' = 6+1 
or, for the block Gauss-Seidel method with overrelaxation (SOR) m: 

TT h [m +1] (1 — uj)n l '[m,} +uj ■ ^7r 4 [m] ■ B 4 + 


fc-l 


| K |-1 


7r ' / [»» + 1] • B 4 '- 6 + jh IT 6 ' M ■ B 4 '’ 4 ) • h'’. 

6'=0 6 ' = 6+1 

Foi the choice of the relaxation parameter u/, 0 < w < 2, which affects the convergence rate, we refer to [20]. 
The analogous equations for the Jacobi method are 


7T 6 [m 4- 1] 


W-i 

(V'H B 4 + • B 4 '’ 4 ) ■ h 4 


b '=0 


and, if relaxation is used, 


7V h [m *f 1] (1 - a;)7r /j [m] 4- {jj • ^n b [m] • B 6 4- 


\n\-i 


w 6 'H ■ B 4 ,! H • h 4 . 


fc'= 0 


For an imperfect partition, elements of the partial probability vectors 7 r\ b = 0, . . . , \7l\ - 1, have to 
be additionally addressed by the corresponding index functions 

V b : Tfl x^ 1 T fc:pfc — {0, . . . , |T n xff =l T k:pk \ - 1}. 
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Fig. 9.1. A model of a tasking system . 

The index functions impose a lexicographic order. 

In other words, for an imperfect partition, we must store the level-2 boolean vectors and use them to test 
whether certain states are reachable. However, if we accept an iteration vector n larger than tv (but smaller 
than the fr for the entire potential state space), we can simply consider all the states described by the level- 1 
boolean vector; some of these are reachable and some are not, since we ignore the level-2 information (indeed, 
we do not even have to store the level-2 vectors). The approach is similar to the original one, suggested by 
Plateau-Stewart [20], but uses only a subset f of the potential state space: T C T C T. 

We can then use the formulas for the perfect partitioning and iterate disregarding whether the states 
corresponding to a pp ^ € 7Z are reachable or not. If, as in Plateau-Stewart, we ensure that any entry 
con esponding to an umeachable state is set to zero in the initial probability vector 7 r^[ 0 ], the nonzero 
entries of the final iteration vector 7r*’[m] after convergence will be the probabilities of the reachable states. 
For our fork-and-join model of Fig. 6.1, this results in a CTMC with 18 states, more than the required 14, 
but still much less than the potential 36 states. 

9. Example and timing results. For the numerical computation we consider the distributed tasking 
system described by the GSPN shown in Fig. 9.1. 

It consists of S server tasks and two classes of customer tasks, C\ of class 1 and C 2 of class 2. Five 
submodels ra*, k = 1, ... ,5, indicated by dashed boxes, interact through the four synchronizing transitions 
shown in grey, representing the beginning (t h i and t h2 ) and the end (t, cA and t. v2 ) of a rendezvous between 
a server task and a task of class 1 or 2, respectively. Tasks of each class run their local computation 
independently, until they need to synchronize with a server task. If no server task is ready to synchronize 
with them, they simply wait until one becomes ready. 

The parameters specifying the stochastic behavior of this software system are the rates of each transition 
in the GSPN. We assume that the weight of every transition is constant, this implies that each transition 
represents an independent hardware resource with single-server semantics, that is, without internal paral- 
lelism (e.g., a single processing unit). Hence, we only need to specify \ x for each transition t, x . More complex 
dependencies can be accommodated by our model: 

• If, for each submodel ra^, t^i and tf, 2 share the same hardware resource, this could be easily accom- 
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ino dated, since both transitions are local to the same model rn k . Only the entries of R^ would be 
affected. 

• If a transition in submodel m k represents a parallel hardware resource, this can also be reflected in 

matrix R^\ by appropriately scaling the rates according to the load on that transition in each local 
state. 

However, the state- dependent behavior across submodels is more limited: 

• The rate of each synchronizing transition can only depend in a “product-form” fashion on the local 
states of the involved submodels. For example, the rate of t {) x can be of the form \ bl • /(#(pi 3 )) ■ 
0(#(P43))- This cannot describe the case where the work performed by to initiate all the wait- 
ing rendezvous (there are min{#(pi3), #(p4 3 )} of them) can be performed in parallel on multiple 
processors. 

• Analogously, if the servers run on a single processor and the rendezvous actions are performed on 
them, submodels rn u m 2 , and m 3 must be merged into a single submodel m 123 . Only then we can 
coriectly represent the processor-sharing interactions between all the transitions using the processor, 
in the local matrix for this larger submodel, R 123 . 

• It should be noted that, in either case, the shared use of resource needed by the synchronizing 
transitions is still not represented correctly, since, again, we cannot express it in product-form 
fashion. For example, the model cannot represent the fact that, when #(pi 3 ) = s, #(p 43 ) = c u 
and #(^53) = t>2, there are min{s, c\ -f - c 2 } rendezvous initiations sharing the same processor. The 
net effect is that the total rate of t b i and t ()2 will be twice what it should be in any global state 
where they are both enabled. However, the timing of these synchronizing transitions should be 
much faster than that of the local transitions in practical models. Indeed, it might be appropriate 
to use immediate transitions to model the synchronizations in our model. We have shown how the 
Kroneckei -based approach can accommodate such immediate synchronizations in [9]. 

The numerical values of the rates of the synchronizing transitions are Xu = Af )2 = A*.i = A r2 = 10.0, 
while the rates of the local transitions are X kl = A Ar2 = k for k = 1,2, 3, 4, 5. We assume that there are N 
tasks of each type, N = C\ = C 2 = S. 

In Table 9.1, we give the expected number E k of tasks in each submodel as a function of the initial 
numbei of tokens N . In particular, tokens in submodels m 2 and m 3 represent tasks of class 1 and 2 in 
rendezvous with a server, respectively. We also compute the throughput r of the system, defined as the 
expected firing rate of any local transition in mi, that is, the rate of completion of server tasks. 

We now comment on the memory and execution complexity of our approach. We consider two different 
structured configuration of the tasking system. 

• MOD 5 , a model consisting of five submodels mi, m 2 , m 3 , m 4 , and m 5 . 

• MOD 3 , a model consisting of three submodels: a larger submodel rn V23 obtained by merging sub- 
models mi, m 2 , and m 3 , plus the submodels m 4 and m 5 . 

We compute the steady-state solution of MOD5 using the conventional structured approach based on the 
actual state space T (CS“ ct ), the “potential” state space T (CS^ ot ) } or our approach based on a perfect 
partition (PP 5 ). For MOD 3 , we use either the conventional structured approach, again based on the actual 
or potential state space T (CS 3 r:f and CS£ oi ), or our imperfect partition approach (IP 3 ) based on T. For PP 5 , 
we partition the local state space of submodels m kj k = 1 , . . . , 5 according to their population, through the 
function part = #(p A i) 4- #(pa. 2 ) 4- #(pa: 3)- For IP 3 , we partition the local state space of submodels m 4 and 
m s as f°r PP.5^ while that of submodel mi 23 is partitioned according to part 123 = #(/>n) 4 #(pi 2 ) 4 #{pis)- 
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N 

E 1 

£2 

E 3 

e 4 

E ., 

T 

1 

0.688 

0.182 

0.130 

0.818 

0.870 

0.335 

2 

1.466 

0.316 

0.218 

1.684 

1.782 

0.539 

3 

2.312 

0.412 

0.276 

2.588 

2.724 

0.663 

4 

3.207 

0.478 

0.315 

3.522 

3.685 

0.741 

5 

4.134 

0.524 

0.342 

4.476 

4.658 

0.792 

6 

5.083 

0.557 

0.360 

5.443 

5.640 

0.827 

7 

6.046 

0.580 

0.374 

6.420 

6.626 

0.853 

8 

7.019 

0.598 

0.384 

7.402 

7.616 

0.872 

9 

8.001 

0.613 

0.390 

8.387 

8.610 

0.886 


Table 0.1 


Expected population in each submodel mid throughput. 


N 

|T fc | 

m 

1*1 

1 

4 

45 

1,024 

2 

10 

693 

100,000 

3 

20 

6,060 

3,200,000 

4 

35 

36,981 

52,521,875 

5 

56 

175,383 

550,731,776 

6 

84 

689,325 

4.182-10 9 

7 

120 

2,341,404 

2.488 10 10 

8 

165 

7,074,990 

1.222 10 11 

9 

220 

19,421,038 

5.15410 11 


Table 0.2 

Size of the state space for MOD& (k = 1 , ... 7 5 / 


The global //-invariants of the tasking system are (no local //-invariants exist): 

3 

Y #(Pkl) + #(p* 2 ) + #{'Pks) = S 

k=l 

Y + #(Pfc2) + #(p*3) = Cl 

fc=2,4 

#^1) + #(pfc 2 ) + #^3) = c 2 . 

Ar=3,5 

The perfect partition of model PP 5 distinguishes among the local loads #(pki) + #(p* 2 ) + #(pks) for each 
submodel m* : ; this naturally correspond to the enforcement of the global //-invariants, whereas partition IP3 
fails to do so, because the loads #(p 2 i) + #(^22) + #(/>23) and #(^31) + #(7/32) + #(7/33) cannot be uniquely 
determined given a class in the partition of submodel m 12 3 . 

In the following, we compare the complexity of the six solutions CSg c * t CS£"*, PP 5 , CSg crf , CSJ"*, and 
IP.i' The size of the local state spaces T* and of the global state spaces T and T are given for configuration 
MOD 5 in Table 9.2. In the tasking model, the two rendezvous create unreachable states (any marking where 
the sum of tokens in in rn 1 . ra 2 , and m 3 is not equal to S), so that \T\ < |T|. In addition, the total 
population of m 2 and m 4 must equal C\ , and the total population of m 3 and iri^ must equal C 2 . Hence, any 
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N 

|T 123 | 

m 

m 

m 

1 

9 

45 

63 

144 

2 

45 

693 

1,305 

4,500 

3 

165 

6,060 

14,504 

66,000 

4 

495 

36,981 

107,811 

606,375 

5 

1,287 

175,383 

603,855 

4,036,032 

6 

3,003 

689,325 

2,739,443 

21,189,168 

7 

6,435 

2,341,404 

10,552,950 

92,664,000 

8 

12,870 

7,074,990 

42,742,692 

3.5038- 10 8 

9 

24,310 

19,421,038 

108,301,458 

1.1766 10 9 


Table 9.3 

Size of the state space for MOD$ (k = 4, 5,1. 


N 

nz 3 

l*ls 

11Z5 

Ms 

p(R) 

1 

18 

3 

32 

3 

90 

2 

122 

6 

88 

6 

2,376 

3 

530 

10 

220 

10 

27,432 

4 

1,810 

15 

440 

15 

199,932 

5 

5,230 

21 

770 

21 

1,075,158 

6 

13,150 

28 

1,232 

28 

4,644,900 

7 

30,702 

36 

1,848 

36 

16,991,520 

8 

65,310 

45 

2,640 

45 

54,513,576 

9 

129,360 

55 

3,630 

55 

157,241,916 

Table 9.4 


M cmory requirements . 

population vector not fulfilling these global p-invariants is unreachable as well, another reason for having 
|T| < |T|. Analogously, the size of the local state spaces T 123 and of the global state spaces T, T , and T. 
are given for configuration MOD 3 in Table 9.3 (T 4 and T 5 are as in Table 9.2). 

The overall number of nonzero elements for the sparse storage of the matrices R* and W*' J are given 
in Table 9.4 for configurations MOD 3 and MOD 5 , under the column headings nz 3 and nz 5 , respectively, 
together with the size of TZ in the two cases. For comparison, we also list the number of nonzero elements in 
R, which would have to be explicitly generated and stored for conventional unstructured solution methods. 
One can see that a conventional solution not based on structured methods would fail for N > 8 because of 
the memory required to store R explicitly. 

Instead, the memory requirements for the local matrices are negligible with respect to the storage of 
the iteration vectors. The main practical limitation of the structured approach is then memory for these 
vectors 1 and for the state space (needed only by CSg rt and CSf‘). The iteration vectors are of size |T| 
foi CS 5 , CS 3 , and PP 5 ; of size |T| for CS 5 and CS 1 ’”* (but MOD 5 and MOD 3 have different potential 
state spaces); and of size |T| for IP 3 . Hence, the memory requirements for these vectors exceed the available 

‘Two iteration vectors, ir[m] and 7r[m + 1], are needed for the Jacobi method or for the Gauss- Seidel method when the 
stopping criterion is based on a relative or absolute comparison of two successive iteration vectors. Only one vector is instead 
needed with Gauss-Seidel if we use the norm of the residual jr[m] (R - nusm( R)) as a criterion. 
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TV 

CSf' 

CSf* 

PPs 

1 

0.117 (131) 

0.107 (31) 

0.050 (117) 

2 

0.800 ( 57) 

0.700 (57) 

0.633 ( 62) 

3 

11.917 ( 86) 

10.533 (86) 

7.917 ( 77) 

4 

114.967 (126) 

-(-) 

77.867 (114) 

5 

821.517 (171) 

-(-) 

551.733 (155) 

6 

4,282.368 (219) 

-(-) 

3,097.833 (199) 

7 

-( -) 

-(-) 

13,153.000 (245) 

8 

-( -) 

-(-) 

47,021.100 (294) 

9 

-( -) 

-(-) 

157,960.610 (346) 


CSf‘ 

CS?"'' 

IPs 

0.083 (131) 

0.063 (131) 

0.050 (117) 

0.567 ( 57) 

0.503 ( 57) 

0.450 ( 62) 

8.167 ( 86) 

7.150 ( 86) 

6.267 ( 77) 

82.450 (126) 

72.833 (126) 

55.400 (114) 

620.717 (171) 525.900 (171) 

408.550 (155) 

3,345.683 (219) 

-( -) 

2,241.867 (199) 

14,721.867 (270) 

— ( — ) 10,121.867 (245) 

52,324.993 (325) 

-( -) 

-( -) 

-( -) 

-( -) 

-( -) 


Table 9.5 


Time (iterations) for the numerical solution as a function of N. 


memory and cause the solution to fail first for CSg°* (when TV > 4), then for CS£ o£ (when TV > 6), and 
finally for IP 3 (when TV > 8), while CS§ C * and PP 5 can be solved up to TV = 9. However, CSg* experiences 
excessive paging due to the additional storage for the actual state space 7~, so we provide data only for 
PP5. I11 principle, CS£ ct could be solved also for TV = 9. but our prototype implementation uses four- bytes 
(unsigned) integers to store a potential state to be searched, so it fails when \T\ > 2 32 = 4.494 • 10 9 , which 
happens for TV > 7, in Table 9.2. 

The overall solution time has two components: the time to explore the reachability set (only CSg cf and 
CS3 requiie this step) and the time for the numerical solution. Since the former is negligible in comparison 
to the latter, Table 9.5 reports only the numerical solution time, in seconds, for the six approaches. We 
use the Jacobi method with a relaxation parameter 0.9 for the conventional structured methods, or a block- 
Jacobi method for PP 5 and IP3, with the same relaxation parameter. Table 9.5 also lists the number of 
iterations required for the solution. In the conventional structured case, these number are the same across 
the foui models, since they differ only in the algorithmic implementation, but they otherwise perform exactly 
the same floating point operations in the same order on the relevant entries of the iteration vectors. Using 
a block Jacobi method for PP 5 and IP 3 models, both the number of outer iterations and of inner iterations 
would be relevant. However, for a fair comparison with the conventional structured solution, we limit the 
numbei of inner iterations to one. Increasing this number would reduce the number of outer iterations and 
result, in most cases, in a substantial speedup. The optimization of the block Jacobi method by varying 
the number of inner iterations, the relaxation, the order in which blocks are considered, or using modified 
adaptive methods are beyond the scope of this paper. 

The uniform distribution is the initial guess for 7r[0], and the relative convergence criterion 


max 

iGT 


KW - +ik[ \ < 10 -3 

*r[m 4 - l]i J 


is used to stop the iterations. The program is run on a workstation with a 400 Mhz DEC Alpha 21164 
processor and 256 MByte of main memory. In all cases, the vector h is stored explicitly and all vectors are 
stored in double precision. The memory requirements would be approximately halved had we used single 
precision instead. 

As we observed before, IP 3 cannot be solved for TV > 8 because of the size of the state space |T|, given 
in Table 9.3. However, when it can be run, IP 3 is the fastest method (Table 9.5). It is faster than PP 5 , since 
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Kronecker products of only three matrices, instead of five, are required, and it is faster than CS 3 , because 
no logarithmic search to compute the index of reachable states is necessary. For N > 8, PP 5 is the fastest 
method, and for TV > 9 it is the only feasible method. However, using the multilevel data structure of [8] 
for the storage of the actual state space T instead of the unsigned integer vector we used in our prototype 
implementation, CS| rf and CSg rt would also be feasible. 

For the perfect partition PP 5 and for the imperfect partition IP 3 using iteration vectors of size |7j, or for 
the conventional structured methods based on the potential states space, CS% ot and CSg"‘, it is not necessary 
to explore and store the global state space. Instead, the local state spaces T k are explored and partitioned 
subject to the partitioning function, and the relation 71 is determined exploring the global state space of an 
aggregated stochastic automatic network that fulfills all global p-invariants. In other special cases (e.g., the 
SGSPNs [10, 11]), the submodels are restricted in such a way that the local state spaces t k can be obtained 
in isolation, thus avoiding the reachability set exploration altogether, but the resulting T k might then be a 
strict superset of T k . In full generality, though, none of these approaches might be possible, and the local 
state spaces T k must be obtained by projection of the global state space on the Ar-th component. In this 
case, we can delete the data structure used to store T as soon as the sets T k have been computed. For CS™* 
and CS ;:l , we must instead keep T throughout the numerical solution, to compute the indexing function ^ 

We stiess that, to achieve a fair comparison of the various approaches, we used the Jacobi method in all 
cases, since it effectively allows us to ignore the rows and columns of R corresponding to the unreachable 
states. This is essential for CS?"*, CS 3 °* , and IP 3 (if based on T). However, any approach based on the 
actual state space can also use a SOR-type iteration, which has faster convergence, although the cost per 
iteration can be somewhat larger in this case, if spurious entries in R t\r,r need to be recognized and ignored 
explicitly. Numerical experiments show that the overall solution time is then generally lower for CSg ct , CS 3 rf , 
and IP 3 (if based instead on T), but certainly lower for a (block) SOR applied to PP 5 , which does not suffer 
from the problem of unreachable states. 

10. Conclusion. We presented a new approach for the Kronecker-based analysis of structured continuous- 
time Markov models. By partitioning the “local state spaces” of each submodel, we are aide to restrict the 
description of the “potential” transition rate matrix R to the “actual” transition rate matrix R correspond- 
ing to the reachable states only, without having to incur a logarithmic overhead. Indeed, the partition also 
allows us to avoid storing the state space altogether, thus reducing the peak memory requirements. 

In addition to the more straightforward “perfect” partition, we also introduced the concept of an “im- 
perfect” partition, where the resulting state space is larger than the actual state space, but still, hopefully, 
much smaller than the potential one. Surprisingly, this might result in the shortest solution time, as long as 
the computation fits in memory. 

One substantial advantage of our approach is that it can naturally employ a block SOR method for the 
numerical solution, which has then a faster convergence rate than the Jacobi method traditionally used in 
previous Kronecker-based approaches. 

Oui algoi ithms aie implemented in the tool SNS [23]. For a copy of the program please contact the 
second author. 
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